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This research uses inteins, a type of mobile genetic element, to infer patterns of 
gene transfer within the Halobacteria. We surveyed 118 genomes representing 26 
genera of Halobacteria for intein sequences. We then used the presence-absence 
profile, sequence similarity and phylogenies from the inteins recovered to explore how 
intein distribution can provide insight on the dynamics of gene flow between closely 
related and divergent organisms. We identified 24 proteins in the Halobacteria that 
have been invaded by inteins at some point in their evolutionary history, including 
two proteins not previously reported to contain an intein. Furthermore, the size of 
an intein is used as a heuristic for the phase of the intein's life cycle. Larger size 
inteins are assumed to be the canonical two domain inteins, consisting of self-splicing 
and homing endonuclease domains (HEN); smaller sizes are assumed to have lost the 
HEN domain. For many halobacterial groups the consensus phylogenetic signal derived 
from intein sequences is compatible with vertical inheritance or with a strong gene 
transfer bias creating these clusters. Regardless, the coexistence of intein-free and 
intein-containing alleles reveal ongoing transfer and loss of inteins within these groups. 
Inteins were frequently shared with other Euryarchaeota and among the Bacteria, with 
members of the Cyanobacteria (Cyanothece, Anabaena), Bacteriodetes iSalinibacter), 
Betaproteobacteria [Delftia, Acidovorax), Firmicutes {Halanaerobium), Actinobacteria 
(Longispora), and Deinococcus-Thermus-group. 



Keywords: gene symbiosis, genome as an ecosystem, inteins, mobile genetic elements, gene flow, horizontal gene 
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INTRODUCTION 

Inteins are self-splicing genetic parasites located in highly con- 
served sites of slowly evolving genes. They are found in all three 
domains of life and in viruses (Perler et al, 1997; Pietrokovski, 
2001; Gogarten et al, 2002; Swithers et al, 2009). Similar to 
group I introns, inteins are often associated with a homing 
endonuclease (HEN). An important difference between inteins 
and introns is the timing of the splicing activity, which occurs 
immediately after transcription in introns and after translation 
in inteins (Hirata et al, 1990; Kane et al, 1990). The asso- 
ciation with a HEN domain enables a cyclic invasion pattern, 
called the homing cycle (Goddard and Burt, 1999; Gogarten 
and Hilario, 2006). The homing cycle consists of three phases: 
intein invasion, intein fixation, and eventually loss of the intein 
enabling invasion to occur again. During invasion and fixa- 
tion the intein splicing domains are associated with a HEN 
domain forming a canonical intein (hereafter referred to as a 
large intein); however, during the loss phase the function of 
the HEN is often disrupted and begins to degrade, generating 
a mini-intein. Simulations have shown that intein-containing 
and intein-free alleles can coexist in well mixed populations 
under some sets of parameters (Yahara et al, 2009; Barzel 
et al, 2011). Also, inteins with ftinctioning HEN domains were 
inferred to have persisted in some eukaryotic lineages for sev- 
eral 100 million years (Butler et al., 2006; Gogarten and Hilario, 
2006). 



Inteins do not have an apparatus to penetrate the cell enve- 
lope. Therefore, they must rely on mechanisms in place within 
the population for insertion into the cell such as: conjugation, 
mating, generalized DNA uptake, and viruses or gene transfer 
agents (Lang et al., 2012). The faster-than-Mendelian inheri- 
tance of the large inteins (Gimble and Thorner, 1992), along 
with a nearly neutral fitness burden, enables these mobile ele- 
ments to persist in organisms over evolutionary time as long as 
there are new populations to invade (Goddard and Burt, 1999; 
Gogarten and Hilario, 2006). Furthermore, the size of the intein 
(mini or large) provides information about the genomic mobil- 
ity of the element as mini inteins are rarely integrated into the 
recipient's genome; whereas large inteins are more frequently 
integrated due to the activity of the HEN. The conservation 
of the recognition site provides an invasion target even in dis- 
tantly related strains and species. Also, inteins have a higher 
substitution rate relative to their extein hosts (Swithers et al., 
2013). This substitution rate gives rise to many evolutionarily 
informative sites when comparing a large collection of homol- 
ogous inteins. In this work, we take advantage of these traits 
and survey the distribution of inteins in the Halobacteria, a 
highly recombinant class of halophilic Archaea (Williams et al., 
2012) known to contain several intein alleles (Perler, 2002). We 
make use of 118 halobacterial genomes (Supplementary Table 1) 
and the previously reported and newly discovered intein alle- 
les to survey networks of gene transfer within and outside the 



www.frontiersin.org 



June 2014 | Volume 5 | Article 299 | 1 



Soucy et al. 



Inteins as indicators of gene flow 



Halobacteria based on the presence-absence profile of the inteins, 
their sequence similarity, and the phylogenies reconstructed from 
intein sequences. 

MATERIALS AND METHODS 

HALOBACTERIAL INTEIN SEQUENCE RETRIEVAL AND ALIGNMENT 

Position specific scoring matrices (PSSMs) were created using 
the collection of all inteins from InBase, the Intein database and 
registry (Perler, 2002). A custom database was created with all 
inteins, and each intein was used as a seed to create a PSSM using 
the custom database. These PSSMs were then used as a seed for 
PSI-BLAST (Altschul, 1997) searches against each of the halobac- 
terial genomes available from NCBI as of June 2013 as well as a 
private collection sequenced by our collaborators. To remove false 
positives, a size exclusion step was then performed on each pro- 
tein sequence as an intein domain adds 100-700 aa to invaded 
protein sequences. Inteins were then aligned using Muscle (Edgar, 
2004) with default parameters in the SeaView version 4.0 sofl;- 
ware package (Gouy et al., 2010). Insertions, which passed the 
size exclusion step, but did not contain splicing domains, were 
removed and the previous steps were repeated using the result- 
ing dataset on a collection of private genomes from the Papke lab. 
Prottest 3.2 (Guindon et al, 2010; Darriba et al, 2011) was used 
to determine an appropriate substitution model for the intein 
sequences, the WAG model was favored and used for all subse- 
quent trees for consistency. Once the collection of halobacterial 
inteins was complete, sequences were re-aligned using SATe (Liu 
et al., 2012) to generate a final alignment using MAFFT (Katoh 
and Standley, 2013) to align. Muscle (Edgar, 2004) to merge, 
RAXML(Stamatakis, 2014) for tree estimation, and a WAG model 
for each allele. 

To determine the relationship among all halobacterial inteins, 
the inteins were aligned using Muscle (Edgar, 2004). Subsequently 
a tree was built using PhyML v3.0 (Guindon et al, 2010) using a 
WAG substitution model with a Gamma shape parameter and the 
proportion of invariant sites estimated from the data. 

INTEIN RETRIEVAL OUTSIDE THE HALOBACTERIA 

Each halobacterial intein was used as a BLAST (Altschul et al., 
1990) query against the non-redundant database on NCBI. Any 
match with an e-value better than 0.000001 was aligned to the 
dataset to which its query belonged. Sequences were then fil- 
tered based on the protein annotation and goodness of fit to the 
existing alignment. As an additional filtering step each match 
was used as a query against the non-redundant database and 
the majority BLAST hit annotations were used to verify the pro- 
tein identity, as annotations are not always reliable. Remaining 
sequences were aligned using Clustal Omega 1.1.0 (Sievers et al, 
2011) with the profile alignment option in SeaView 4.0 (Gouy 
et al., 2010). Maximum-likelihood trees were buUt using PhyML 
(Guindon et al., 2010) with the WAG model, and rates estimated 
from the data. 

To assess the relative contribution of different genera repre- 
sented in each intein allele sequence data set, a stacked column 
graph was created. Sequence density was calculated for each intein 
allele by dividing the number of intein sequences in each genus by 
the number of total intein sequences in that allele. 



SYMBIOTIC STATE ASSIGNMENT 

Intein sequence length was used to determine symbiotic state. 
For each intein allele the length of the intein sequence was deter- 
mined. A cutoff length for mini-intein assignment was based on 
the presence of a gap in intein lengths greater than 100 amino 
acids within an allele. The third intein state "no-intein" was 
assigned where the intein was clearly absent from the orthologous 
protein containing an intein in any of the halobacterial genomes 
examined. Additionally, once an intein was noted as a mini-intein 
the alignment was analyzed to ensure the gaps in these sequences 
correspond to the location of the HEN domain. 

RIBOSOMAL PROTEIN REFERENCE TREE 

Alignments of 55 ribosomal protein for 21 Halobacteria 
(Williams et al., 2012) were used to find orthologous proteins 
in the genomes used in this work. In-house python scripts (data 
file 1) were used to concatenate the alignments, and PhyML v3.0 
(Guindon et al., 2010) was used to build a tree. The tree used the 
WAG substitution model with the Gamma shape parameter and 
the proportion of invariant sites and base frequencies estimated 
from the data. 

BAYESIAN CLUSTERING WITH INTEIN SEQUENCES 

A concatenation of an intein presence-absence matrix and align- 
ments for each intein allele were generated using in-house python 
scripts (data file 1). MrBayes version 3.2.1 (Ronquist et al, 2012) 
was then used to perform a clustering analysis using a partition 
allowing for character states in the presence-absence matrix and 
sequence information for each intein allele. The prior for the 
character portion of the data matrix used a symmetrical Dirichlet 
distribution with an exponential (1.0), and variable rates so each 
column was considered independent of the others. The likelihood 
for the character portion of the alignment used variable coding 
and 5 beta categories. The prior for the protein sequences in the 
alignment used a fixed WAG substitution model, with state fre- 
quencies estimated from the data, and the likelihood settings used 
a Gamma shape parameter and the proportion of invariant sites 
estimated from the data. 

RESULTS 

HALOBACTERIAL INTEINS 

The intein content of a collection of halobacterial genomes 
was analyzed using an intein-allele-specific PSSM. This survey 
revealed 13 genes in the Halobacteria invaded by inteins at 
24 distinct positions (intein alleles) (Table 1). Seven of these 
intein alleles were not previously reported in the Halobacteria, 
and two of the seven have not previously been reported to 
harbor inteins: a DNA ligase gene involved in double strand 
break repair, and a deaminase gene involved in nucleotide 
metabolism (Table 1). To determine if vertical inheritance was 
accountable for the distribution of intein alleles, the presence- 
absence matrix of intein alleles was mapped onto a refer- 
ence phylogeny (Figure 1). Clearly, intein presence-absence is 
not concordant with the ribosomal protein phylogeny, impli- 
cating abundant horizontal genetic transfer (HGT) in creating 
the observed distribution. The presence of multiple intein alle- 
les in the majority of genomes (70%) might be interpreted 
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Table 1 | Exteins in the halobacteria. 



Intein allele 


Extern annotation 


cdc21-a 


Cell division control protein 21 


cdc21-b 




cdc21-c 




polB-d 


DNA polymerase B1 


polB-a 




polB-b 




polB-c 




pol-lla 


DNA polymerase II large subunit 


pol-llh* 




dtd** 


Deoxycytadine triphosphate deaminase 


gyrB 


DNA gyrase subunit B 


helicase-b* 


ATP-dependent helicase 


ligase* * 


ATP-dependent DNA ligase 1 


rfc-a 


Replication factor C small subunit 


rfc-d* 




hrlA* 


Ribonucleoside-diphosphate reductase 


hrl-k 




rir1 h 




rir1-g 




rirl-m* 




rpolA 


DNA-directed RNA polymerase subunit A 


udp 


UDP-glucose 6-dehydrogenase 


topA 


DNA topoisomerase 1 


top6B 


DNA topoisomerase VI subunit B 



'Denotes intein alieies discovered in this woric. 

"Denotes extein sequences not previously reported to be invaded by an intein. 



to suggest that inteins could spread locally within a single 
genome. 

INTEIN PROPAGATION WITHIN THE HALOBACTERIA 

To address the possibility of inteins moving locally within a 
genome, the phylogenetic relationships among all halobacterial 
intein sequences were analyzed (Figure 2). All of the intein alleles 
form highly supported clusters with others of the same type, with 
the exception of two sequences: the polB-c inteins of Haloferax 
larsenii and Haloferax elongans group inside the polB-h intein 
allele cluster; however, this node is poorly supported (59/100 
bootstraps) indicating this relationship could be an artifact pro- 
duced by poor resolution of the relationships that connect various 
intein alleles. Furthermore, there is poor support linking all of the 
intein allele clusters together (less than 70% bootstrap support), 
indicating sequence conversion (an intein invading an ectopic or 
atypical locus) between intein alleles, even within the same host 
protein, is uncommon. Among the inteins analyzed here, at most 
one invasion of an ectopic site is supported by the data, confirm- 
ing that this type of event is rare (Perler et al., 1997; Gogarten 
et al., 2002). These data indicate that HGT is the only plausible 
explanation for the large number of different intein alleles in this 
class of organisms. Incongruence between the presence of inteins 
and ribosomal phylogeny also support this conclusion. 

BAYESIAN PHYLOGENETIC ANALYSIS OF INTEINS 

In an attempt to resolve the local events (transfers and verti- 
cal inheritance within the Halobacteria) that gave rise to the 



observed intein distribution in the Halobacteria, a Bayesian anal- 
ysis based on the intein sequences for each allele and on the 
presence-absence pattern was performed (Figure 3). In this analy- 
sis two organisms may group together because they both inherited 
inteins from a common ancestor, or because an intein was recently 
transferred between them. The paucity of well-supported nodes 
(nodes with 0.95 or greater posterior probability were considered 
well-supported) in part reflects the extent to which our sample 
is biased toward very similar sequences (31% of halobacterial 
genomes in this study are from Halorubrum) . Most of the well- 
supported clusters in the Bayesian tree also occur in the reference 
tree, suggesting these inteins may be the result of shared vertical 
inheritance. However, many of these clusters do not have iden- 
tical intein profiles (clusters 1, 6, 8, and 10), thus HGT between 
close relatives is a better explanation than vertical inheritance for 
these clusters. Only three of the clusters, 2, 9, and 12, have branch- 
ing orders that are different from those observed in the reference 
tree indicating HGT. Cluster 2 is made up of Natrinema spp. 
pellirubrum and versiforme which share only the pol-IIa intein. 
In the reference tree Nnm. versiforme groups with the rest of 
the Natrinema, and Nnm. pellirubrum groups with Haloterrigena 
thermotolerans. Natrinema sp. J7-2 is the only other member of 
the Natrinema that has an intein in the pol-IIa. position, but 
the intein in this species is 14 aa shorter than the intein shared 
by Nnm. pellirubrum and Nnm. versiforme. Htg. thermotolerans 
shares no inteins with Nnm. pellirubrum. Cluster 9 is made up 
of Halorubrum spp. C49 and E3, which share only the cdc21- 
h intein. In the reference tree Hrr. E3 groups with Halorubrum 
litoreum and the two share the pol-IIa intein allele, but no others. 
Hrr. C49 groups with Halorubrum saccharovorum and they do not 
share any inteins. Cluster 12 is made up of Haloferax spp. denitrif- 
icans, lucentense, alexandrinus, and Haloferax sp. BAB2207, which 
all have an intein in the cdc21-& position. In the reference tree 
Hfx. lucentense, Hfx. sp. BAB2207, andH/x. alexandrinus all group 
together, but Hfx. denitrificans groups with Haloferax sulfurifontis, 
and they do not share any inteins. The lack of shared inteins 
between clusters in the reference tree and differences among the 
inteins shared in these clusters cause these divergences in this 
tree as compared to the reference tree. This may indicate that the 
taxa in the Bayesian clusters are exchange partners, or that they 
share unsampled intermediate exchange partners. AdditionaUy, 
the majority of clusters share 2 or fewer intein alleles between all 
members of the cluster (eight out of 12 clusters). The two clus- 
ters that share the most intein alleles between all members are 
Cluster 3, made up of Haloqudratum walsbyi strains DSM 16790 
and C23 with 13 shared intein alleles, and cluster 7 made up of 
Halorubrum spp. strains SP3 and SP9 sharing 4 intein alleles. Both 
of these clusters have branching patterns identical to those on 
the reference tree, indicating that phylogenetic proximity plays a 
significant role in intein distribution. 

Members of the Halorubrum genus, not surprisingly, were 
highly represented in the clusters (four of 12 total). All four of 
the clusters show a geographic bias. Clusters 6, 8, and 9 were all 
isolated from the Aran-Bidgol lake in Iran, and cluster 7 was iso- 
lated from the Sedom Ponds in Israel (Atanasova et al., 2012). 
Branch lengths in all of these clusters are very small, suggesting 
these populations are well mixed with respect to intein sequences. 
Geography does not seem to play a strong role in linking other 
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FIGURE 1 I Intein Invasion Pattern in the Halobacteria. Intein pattern 
of presence-absence is mapped onto the tips of a ribosomal 
reference tree, teal boxes indicate the presence of a full size intein, 
yellow boxes indicate the presence of a mini-intein, black boxes 



indicate the absence of an intein, and white boxes indicate missing 
data. Purple shaded boxes indicate the genera with more than five 
species represented on the tree. Nodes with bootstrap support <70 
are in gray. 
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FIGURE 2 I Relationships among intein alleles in the Halobacteria. TInis 
tree depicts the phylogenetic relationships among intein alleles in the 
Halobacteria. Inteins that clustered in concordance with the allele were 
collapsed to a single node, and labeled with the name of the intein allele. 
Two polB-d sequences did not group with the polB-A allele, and instead are 
located amongst the polB-c alleles these are indicated in red. Nodes with 
bootstrap support <70 are colored gray. 



well-supported clusters based on intein sequences. Furthermore, 
evidence of clustering based on geography in the Halorubrum is 
less interesting than the clear separation between groups isolated 
from the same location (cluster 6, 8, and 9). This separation of 
species of Halorubrum from the same location is echoed in the 
reference tree, and taken together with the short branch lengths 
in these clusters indicate that population structure plays a strong 
role in gene sharing at least for this location (see Fullmer et al., 
2014 for in depth discussion). Increased geographical sampling 
could reveal similar trends in other locations. 

INTEIN HOMING IN THE HALOBACTERIA 

The existence of a singleton in an intein allele in the genomes 
analyzed could represent intein invasion from outside the 
Halobacteria; but could also be due to incomplete sampling. To 
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FIGURE 3 I Clustering of Halobacteria based on intein sequences and 
distribution. Halobacteria were clustered based on intein sequences and 
the distribution in each genome. Clusters with posterior probability >95% 
are shaded purple. 



investigate the phylogenetic distance of invasion events respon- 
sible for the observed distribution of inteins, the halobacterial 
inteins were used as queries to search for homologous sequences 
in the non-redundant database (Altschul et al., 1990). Intein 
sequences that matched the alleles in the Halobacteria were found 
in other Euryarchaeota (but not Crenarchaeota), and Bacteria 
(Table 2). To ascertain whether homing occurred between the 
Halobacteria and organisms outside the Halobacteria, a max- 
imum likelihood tree was buUt for each intein allele. The 



www.frontiersln.org 



June 2014 | Volume 5 1 Article 299 | 5 



Soucy et al. 



Inteins as indicators of gene flow 



Table 2 | Taxonomic distribution in eacii intein allele. 



Intein 
allele 

cdc21-a 
cdc21-c 
dtd" 
gyrB 

helicase-b 

ligase 

pol-llb* 

polB-d 

rfc-a 

rfc-d* 

rirl-b 

rirl-g 

rirl-k 

rirl-l* 

rpolA 

top6B 

topA 

udp 

rirl-m 

polB-c 

polB-a 

polB-b 

pol-lla 

cdc21-b 



Tree topology Halobacteria Bacteria Other 

Euryarchaeota 



Monophyletic 

(Vlonophyietic 

(Vlonophyietic 

(Vlonophyietic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Monophyletic 

Polyphyletic- 

bacteria 

Polyphyletic- 

bacteria 

Polyphyletic- 

Euryarchaeota 

Polyphyletic- 

Euryarchaeota 



55 
1 
6 
6 
1 
1 
9 
6 
16 
5 
15 
4 
5 
3 
10 
8 
4 
7 
1 

20 
16 

38 

75 

51 



4 
0 
0 
19 
2 
0 
0 
0 
0 
0 

55 

15 

1 

3 

0 

0 

0 

2 

4 

1 

2 



16 

0 

0 

1 

1 

0 
1 

1 

13 

0 

5 

0 

0 

0 

0 

0 

1 

6 

0 

1 

1 

0 

16 

3 



'Denotes intein alleles discovered in this work. 
"Denotes exteins discovered in this worAr. 



tree topologies were evaluated with respect to the halobacterial 
inteins. If the halobacterial inteins in the tree were monophyletic 
it was assumed that except for the initial invasion gene flow for 
that intein allele occurred within the Halobacteria exclusively. If 
the halobacterial inteins were polyphyletic, invasion events that 
generated the observed distribution likely involved organisms 
outside the Halobacteria either as donors or as recipients. The 
majority of intein trees, 83%, were monophyletic, reinforcing 
the idea that recombination is more successful between closely 
related organisms (Gogarten et al., 2002; Zhaxybayeva et al., 
2006; Andam et al, 2010; Papke and Gogarten, 2012; Williams 
et al., 2012). Interestingly, for trees where the Halobacteria were 
polyphyletic, the organisms interrupting the clade were Bacteria 
for two out of the four polyphyletic intein alleles. The sam- 
ple size restricts building strong claims about HGT between the 
Halobacteria and the Bacteria. However, this claim is supported 
by previous evidence of gene exchange between the Bacteria and 
the Halobacteria (Ng et al, 2000; Khomyakova et al, 2011). 

The tight clustering of halobacterial intein sequences and 
short branches between closely related strains indicate that in the 
majority cases inteins are inherited vertically or are transferred 



between closely related strains, and that successful invasion across 
large genetic distances is rare. Thus, intein alleles that are found 
in many different genera have been active for many generations, 
enabling invasion of many lineages, and accumulating exam- 
ples of rare invasion events such as those that cross domain 
boundaries. Conversely, a lack of taxonomic diversity cannot be 
interpreted as a recent invasion as sampling limitations could be 
responsible for the paucity of samples in that intein allele. While 
many factors influence the success of intein transfer between 
divergent organisms, phylogenetic diversity of the organisms 
invaded by a particular intein allele also is a reflection of the 
time the intein allele has been present in a linage. Furthermore, a 
high density of intein sequences in a particular domain or group 
of genera can be used to determine the most likely reservoir for 
the circulating intein allele. A stacked column chart was used 
to quantify the representation of each of the genera in each of 
the intein alleles (Figure 4). Five intein alleles, cdc21h, pol-IIa, 
polBh, cdc21&, and r/c-d, show polarity in intein density favor- 
ing the Halobacteria (specifically Halonibrum) as the reservoir 
for the intein population. This is not surprising as the data indi- 
cate that the majority of intein transfer in the Halobacteria is 
within the class. Additionally, the diversity in five of the intein 
alleles, helicase-h, cdc21a, gyrB, rirl-h, and udp, suggests these 
intein populations may be more ancient than the others in this 
study as they have had time to accumulate rare, long distance 
transfers such that the diversity within them spans both class 
and domain boundaries. Interestingly, the helicase-h intein was 
only recently discovered in this study, though the diversity in the 
allele gives the impression that this intein has been around for 
a long time. 

TRANSFER OF INTEINS BETWEEN HALOBACTERIAL 
AND NON-HALOBACTERIAL LINEAGES 

Not all inteins are transferred equally; the efficiency of intein inva- 
sion is affected largely by the state of the intein. The HEN domain 
in canonical inteins is required to induce a double strand break 
and the subsequent homologous repair that results in invasion 
(Pietrokovski, 2001). Thus, mini-inteins that have lost a func- 
tioning HEN domain are mainly transferred vertically (they may 
be transferred horizontally together with the host gene). If an 
intein containing allele has been fixed in a population, either a 
precise deletion of the mini intein encoding DNA could remove 
the intein from the population or homologous replacement by an 
intein-free allele transferred from outside the population. Thus, 
mini-inteins are maintained through strong purifying selection, 
because any mutation that decreases the self-splicing activity 
decreases the availability of the host protein (Barzel et al., 2011). 
The intein states were determined to infer patterns of hom- 
ing in the Halobacteria. The size of inteins in each allele, along 
with the position of gaps in the alignment relative to the HEN 
domain were used as a heuristic for assigning mini-intein sta- 
tus. In most cases there was a clear separation in the distribution 
of intein lengths (at least 100 amino acids difference in length). 
The size of more populated intein alleles within the three gen- 
era of the Halobacteria with the largest number of available 
genomes, Haloarcula, Haloferax, and Halorubrum, were recorded 
in a matrix of intein alleles (Figure 5). Many intein alleles show 
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FIGURE 4 1 Phylogenetic diversity in hialobacterial intein alleles. A 

stacked column graph depicts the representation of the Halobacteria 
(in purple), the Bacteria (in blue), and other Euryarchaeota (in green). 
Intein alleles are ordered by the number of intein sequences 
recovered for each allele, which is reported in parenthesis after the 



intein allele name on the x-axis. The number of genera for each 
intein allele is indicated by the number of breaks in the column 
(white lines) and the height of each of the fragments that make up 
a column indicate the proportion of sequences in that allele found 
in a particular genus. 
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FIGURE 5 I Intein size distributions in the Haloferax, Haloarcula, 
and Halorubrum. The size of inteins in the Haloarcula (A), Haloferax 
(B), and Halorubrum (C) are indicated in the column corresponding 
to the intein allele. Mini-inteins are colored yellow, large inteins are 
colored teal, black boxes indicate no intein, and white boxes indicate 



missing data, clusters from Figure 3 are indicated by numbered 
orange boxes. The cclc21-a, and b sequences for Halorubrum sp. 
J07HR59, though smaller than the rest, cannot be considered 
mini-inteins, as the intein sequences in these positions are not 
complete. 
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a considerable size variation. This variability can be attributed 
to the accumulation of insertions and deletions in various lin- 
eages over time, which in some lineages leads to loss of the 
HEN domain. Notably, there is no variability in the size of intein 
sequences shared by the clusters recovered in the Bayesian analy- 
sis (orange boxes Figure 5) reinforcing the claim of ongoing gene 
exchange in these clusters. 

Invasion from outside the Halobacteria is one explanation 
for the polyphyletic topology observed in some halobacterial 
intein alleles. To determine when these homing events could have 
occurred, the state of each intein was determined and mapped 
onto polyphyletic intein allele trees: the results of that analysis 
are summarized in Table 3, with mini-inteins indicated with a 
star (*), and inteins that group within the Halobacteria indicated 
by a tUde (~) next to the name of the organism. Many of the 
intein sequences (5 out of 11) from taxa outside the Halobacteria 
that interrupt the clade are large-inteins, indicating that interac- 
tions between these taxa and the Halobacteria, though rare are 
ongoing (Table 3). Though the assignment of direction of trans- 
fers is extremely preliminary as limited sampling can affect the 
assignment of direction of transfer, there are some cases with an 
overwhelming signal where the majority of sequences originate 
from the Halobacteria, or the Bacteria in the case of rirl-m. The 
mixture of mini and large inteins represented in all of the intein 
alleles imply most of these inteins are active in the Halobacteria, 
and notably involve a wide distribution of taxonomic exchange 
partners. 

DISCUSSION 

The importance of HGT throughout the tree of life demands 
the development of a system to monitor gene-flow within and 
between populations. This research provides fundamental evi- 
dence that mobile elements such as inteins can be used to 
uncover gene flow networks. Inteins have a unique combina- 
tion of traits that make them ideal tools to study evolution in 
microbial populations. They have a naturally wide phylogenetic 
distribution, enabling detection of HGT between distantly related 
taxa. This is demonstrated in this work by the intein trees where 
the Halobacteria were polyphyletic [pol-IIa, polB-a, polB-h, and 
cdc21h) indicating intein transfer between the Halobacteria and 
the taxa that interrupt them, as well as by data from other studies 
where intein transfer has been detected across phyla and domains 
(Butler et al, 2006; Swithers et al, 2013). Inteins also have a high 
substitution rate relative to their extein hosts, and a propensity 
for accumulating insertions and deletions, which makes detec- 
tion of transfers between close relatives (generally a difficult task) 
possible; for example, transfer within the Halorubrum clusters 
shown in Figure 3. Inteins can be associated with a HEN domain. 
If they are, they possess the ability to invade intein-free alleles 
following transfer; if they are not, they rely mainly on vertical 
inheritance together with the host gene, and the occasional trans- 
fer of the host gene. One intein allele, pol-IIa, is widely distributed 
in the Halobacteria and there are many examples of mini-intein 
sequences in this allele. These data suggest that invasion of this 
allele occurred early in the evolution of the Halobacteria, and that 
the intein may have been lost in some lineages, but retained as 
a mini intein in most of the genomes surveyed here. This could 



also be true for the cdc21-a intein; however, the distribution is 
not as diverse, and considerably fewer mini-inteins were detected. 
This is more suggestive of an intein that has been active in the 
Halobacteria for a long period of time, with the different intein 
states (empty target site, target site invaded by an intein with 
active HEN, target site occupied by an intein without function- 
ing HEN; Yahara et al, 2009; Barzel et al, 2011) existing and 
co-existing in different halobacterial lineages. 

The genomes analyzed in this work were cultured from salty 
water and soil samples around the world. The diverse background 
of the genomes may contribute to the spotty distribution of intein 
alleles (Figure 1). However, genomes isolated from the same loca- 
tion show variation as well (Figure 3) (Fullmer et al, 2014), 
reinforcing the notion that inteins are currently actively propa- 
gating in and being eliminated from halobacterial populations. 
Additionally, previous data have shown recombination occurs at 
a higher rate than mutation within the Halobacteria, and very 
little linkage between genes is detected in these genomes (Papke 
et al., 2004, 2007). These observations indicated gene flow as an 
important method for niche adaptation in these organisms. In 
Deep Lake, Antarctica the freezing temperatures limit the rate of 
replication to approximately 6 times per year and evolution in 
the halobacterial populations there mainly occurs through gene 
flow (Demaere et al., 2013). Recent whole genome comparisons 
revealed frequent gene transfer followed by homologous replace- 
ment of the transferred gene within the Halobacteria, hampering 
attempts to resolve the phylogeny within this group (Williams 
et al, 2012). Gene flow and recombination between populations 
and species make it difficult to resolve the species phylogeny 
among the different genera of Halobacteria (Papke et al, 2004). 
The use of gene concatenation in building reference trees, as 
exemplified by the ribosomal protein reference tree used in this 
work, has been pivotal in determining a branching order for the 
major clades of organisms, such as the Halobacteria, that par- 
ticipate in a large amount of recombination with close relatives. 
However, because genetic transfer and homologous recombi- 
nation occur frequently between close relatives, the resulting 
phylogeny reflects both, shared ancestry and frequency of gene 
transfer. Therefore, determining the network of gene flow that 
overlays the vertical signal is important to the understanding of 
the evolution of these organisms. Inteins cannot penetrate the 
cell wall, and thus capitalize on existing gene flow in populations 
to efficiently invade when the opportunity presents itself This 
trait can be exploited to keep track of successful homing events 
revealed by sequence similarity of inteins in distinct strains. 

Halorubrum was the only genus in this study that had a 
large enough sample size to begin to uncover a signal reflecting 
population structure. Many of the Halorubrum genomes in this 
study were isolated from the same location, and this collection 
of genomes showed a clear signal for a structured population. 
Sixteen genomes from Aran-Bidgol were separated into four well- 
supported clusters. Three of the four clusters have branching 
orders identical to those in the reference tree, and the support val- 
ues for those clusters could be attributed to both transfer within 
the group and a background phylogenetic signal or ancestral 
inheritance of similar intein alleles. However, only cluster 7 in 
the Halorubrum shares all intein alleles between all members of 
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Table 3 | Protein sequence identifiers for intein sequences. 



Intein allele 


Species name 


Accession number 


Phylum 


CQCZ I -a 


ArchdGOQiobus profundus DSM 5631 


VP nn/iQ/in7Rn i 


Euryarchaeota 




Archdeoglohus veneficus SNP6 


VD nnQ/inriKOQ 1 
Y r_UUo4UUozo. 1 


Euryarchaeota 




Candiddtus Methanomassiliicoccus intestinalis Issoire Mxl 


Yr_UUo(J/zbDo. 1 


Euryarchaeota 




Croccosphaera watsonii 


VVr_Uzlo3D3/o.l 


Cyanobacteria 




rerrogiODUS piacious Uoivi iud4z 


VD nriQ/iQR/iiQ 1 
Y r_UUo4oo4 1 y. I 


Euryarchaeota 




HalarchaGum acidiphilum 


\A/D mnoon70R i 


Halobacteria 




Lamprocystis purpurea 


vVr_UzUC)U4 loo. 1 


Gammaproteobacteria 




Methanomassiliicoccus luminyensis 


\/vr_Uiy 1 /o4 lb. 1 


Euryarchaeota 




Methanothermococcus okinawensis IH1 


VD nn/1 K"7Cy1 "71 1 

Yr_UU4b/D4/ 1. 1 


Euryarchaeota 




Nocaroia asteroides NdHL IbbJl 


bAUoJIJz.l 


Actinobacteria 




Nocardiopsis potens 


Wr_UzU3oU31 b. 1 


Actinobacteria 




Pyrococcus abyssi GE5 


Nr_lz/llb.l 


Euryarchaeota 




Pyrococcus furiosus DSM 3638 


Nr_b/ozn.1 


Euryarchaeota 




Pyrococcus horil<osiiii 0T3 


M D 1 /IT -1 OO 1 

Nr_ I4z IZZ. 1 


Euryarchaeota 




Pyrococcus sp. NA2 


Y r_UU44z4 1 oo. 1 


Euryarchaeota 




Tliermococcus litoralis DSM 5473 


Yr_UUo4zy / 1 /. 1 


Euryarchaeota 




Thermococcus onnurineus NA1 


YP_002306424.1 


Euryarchaeota 




Thermococcus sibiricus MM 739 


YP_002994932.1 


Euryarchaeota 




Thermococcus sp. AM4 


YP_002582218.1 


Euryarchaeota 




Thermococcus sp. CL1 


YP_006424652.1 


Euryarchaeota 




Thermococcus zilligii 


WP_010479121.1 


Euryarchaeota 




IHalorubrum sp. SP3 


KJ_8656871 


Halobacteria 




IHalorubrum sp. SP9 


KJ_865689.1 


Halobacteria 


caczi-i) 


LyanoTnece sp. rLL /ozz 


VD nr\ooo"70ci"7 1 
Yr_UUooo/oy /. 1 


Cyanobacteria 




Halarchaeum acidiphilum 


WP_020220725.1 


Halobacteria 




Candidatus Methanomassiliicoccus intestinalis Issoire-Mxl 


YP_008072558.1 


Euryarchaeota 




Mettianornassiliicoccus iuminyensis 


vvr_Uiy 1 /o4 lb. 1 


Euryarchaeota 




~ Tliermococcus barophilus 


YP_004070279.1 


Euryarchaeota 




Halorubrum sp. SP3 


KJ_8656871 


Halobacteria 




Halorubrum sp. SP7 


KJ_865688.1 


Halobacteria 




Halorubrum sp. SP9 


KJ_865689.1 


Halobacteria 


polB-d 


Archaeoglobus profundus DSM 5631 


YP_003400528.1 


Euryarchaeota 


polB-a 


~Salinibacter ruber M8 


YP_003572085.1 


Bacteroidetes 




-Salinibacter ruber DSM 13885 


YP_446104.1 


Bacteroidetes 




~Halarcliaeum acidiphilum 


WP_020678478.1 


Halobacteria 




'^Methanoculleus bourgensis 


Y r_UUbb44bzo. 1 


Euryarchaeota 


polB-b 


Halosimplex carlsbadense 


WP_006885382.1 


Halobacteria 




~Salinibacter ruber M8 


YP_003572085.1 


Bacteroidetes 




'^oaiiniDacter ruoer Ubivi i oooo 


Yr_44b IU4. 1 


Bacteroidetes 




-^Halanaerobium saccharolyticum 


WP_0054890971 


Firmicutes 




IHalarchaeum acidiphilum 


WP_020678478.1 


Halobacteria 


polB-c 


-^Thermus scotoductus 


YP_004202875.1 


Deinococcus-Thermus 




Methanotorris igneus Kol 5 


VD nn/l/lQQ7QQ 1 

1 r_uu44do/yy. 1 


Euryarchaeota 




Halorubrum sp. SP7 


KJ_865686.1 


Halobacteria 


pol-lla 


Archaeoglobus veneficus SNP6 


YP_004341738.1 


Euryarchaeota 




Halosimplex carlsbadense 


WP_006882195.1 


Halobacteria 




Methanocaldococcus infernus ME 


YP_00361 69471 


Euryarchaeota 




Methanococcus aeolicus 


ABU41683.1 


Euryarchaeota 



(Continued) 
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Intein allele Species name 


Accession number Phylum 


* 

Methsnocullsus bourgonsis I\/1S2 


YP_006544019.1 


Euryarchaeota 


MQthsnocullGUS mdrisnigri JR-1 


YP_001 048029.1 


Euryarchaeota 


M&thdnofoiiis liminstdns 


WP_004037227.1 


Euryarchaeota 


MQthsnoliriQd tdrd3 


WP_007314808.1 


Euryarchaeota 


Msthsnoplsnus limicold 


WP_004076782.1 


Euryarchaeota 


Msthsnopldnus petroledrius DSM 11571 


YP_003893638.1 


Euryarchaeota 


Methanoregula boonei 6A8 


YP_001 403293.1 


Euryarchaeota 


Methanoregula fomicica SMSP 


YP_007242862.1 


Euryarchaeota 


Methanosphaerula palustris E1-9c 


YP_002467270.1 


Euryarchaeota 


Metahnosplrillum hungatei JF-1 


YP_503855.1 


Euryarchaeota 


* Pyrococcus horlkoshii 0T3 


MP i49nn 1 

INF 1 lOU. 1 


l_UI yalL.1 icacULCI 


* Thermococcus gammatolerans EJ3 


YP 002958492.1 


Fi irvarrhapnta 


* Thermococcus sibiricus MM 739 


YP_002994988.1 


Euryarchaeota 


uncultured haloarchaeon 


ABQ75865.1 


Halobacteria 


Halorubrum sp. SP3 


KJ_865692.1 


Halobacteria 


Halorubrum sp. SP7 


KJ_865690.1 


Halobacteria 


Halorubrum sp. SP9 


KJ_564691.1 


Halobacteria 


pol-llb Halosimplex carlsbadense 


WP 006887195 1 


1 1 0 1 WkJG U IC7 1 IQ 


* Pyrococcus abyss i GE5 


YP 004674494 1 


Fi 1 r\/?i rf^h?ipnta 

1_UI yaiiji laCJ La 


uncultured haloarchaeon 


ABQ75865.1 


Hfllnhaptpria 

1 IQ 1 Vi^k^Ci V.J 1 lO 


gyrB Allochromatium vinosum DSM 180 


YP_003443943.1 


Gamnnaproteobacteria 


Anabaena sp. 90 


YP_006997726 


Cyanobacteria 


Anaoaena sp. rLO / lUo 


WP_01 69501 32.1 


Cyanobacteria 


DaCIIIUS SUDIIIIS D to 1 /D 1 o 


BAM51471.1 


Firmicutes 


L^aioinnx sp. rL^*^ / luj 


WP_01 9489451.1 


Cyanobacteria 


Coleofasciculus chthonoplastos 


WP_006099284.1 


Cyanobacteria 


Cylindrospermopsis reciborskii 


WP_006276716.1 


Cyanobacteria 


Dactylococcopsis slaina PCC 8305 


YP_007173052.1 


Cyanobacteria 


Halarchaeum acidiphilum 


WP_02 1780646.1 


Halobacteria 


Methanomassiliicoccus luminyensis 


WP_019178436.1 


Euryarchaeota 


Microcystis aeruginosa 


WP_002774451.1 


Cyanobacteria 


Moorea producens 


WP_008190351.1 


Cyanobacteria 


usciiiaioria sp. rL^L. lUoUz 


WP_017715151.1 


Cyanobacteria 


rieurocapsa sp. rUL. /oiy 


WP_01 95090771 


Cyanobacteria 


Prochlorothrix hollandica 


WP 017710941 1 


Cv3 nnhpptpria 


Raphidiopsis brookii 


vvr uuc70'+z.uo'+. i 


^.^ydl lUUaULcjl Id 


Rivularia sp. PCC 7116 


YP_007054134.1 


Cyanobacteria 


Saccharothrix espanaensis DSM 44229 


YP_007037469.1 


Actinobacteria 


Synechocystis sp. PCC 6803 


NP_441040.1 


Cyanobacteria 


Trichodesium erythraeum IMS101 


YP_723459.1 


Cyanobacteria 


uncultured bacterium 


EKD46222.1 




helicase-b * Bacillus amyloliqufaciens TA208 


YP_005540906.1 


Firmicutes 


* Bacillus subtilis 


WP_01 7696872.1 


Firmicutes 


NanoarcnaeoTa arcnaeon oUui^ aaaui i-lzz 


WP_01 8204386.1 




rfc-a Methanocaldococcus jannaschii DSM 2661 


NP_248426.1 


Euryarchaeota 


Methanocaldococcus sp. FS406 


YP_003458055.1 


Euryarchaeota 


Methanothermococcus okinawensis IH1 


YP_0045763371 


Euryarchaeota 


* Methanotorris formicicus 


WP_0070442971 


Euryarchaeota 


* Pyrococcus abyss i GE5 


NP_1 25803.1 


Euryarchaeota 


/Conf/nuedi 
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Intein allele 


Species name 


Accession number 


Phylum 




Pyrococcus furiosus DSM 3638 


Nr_o/ /oZZ. 1 


Euryarchaeota 




Pyrococcus horikoshii 0T3 


NP_142122.1 


Euryarchaeota 




Pyrococcus sp. ST04 


YP_006353924.1 


Euryarchaeota 




Thermococcus kodakorensis K0D1 


YP_184631.1 


Euryarchaeota 




Theruiococcus litoralis DSM 5473 


VD nriQ/lOQQQTI 

T r_UUo4zooy/. 1 


Euryarchaeota 




Thermococcus sp. 4557 


YP_004763272.1 


Euryarchaeota 




Thermococcus sp. AM4 


YP_002582171.1 


Euryarchaeota 




Thermococcus sp. CL1 


YP_006425306.1 


Euryarchaeota 


rpolA 


Halorubrum sp. SP3 


KJ_865684.1 


Halobacteria 




Halorubrum sp. SP9 


KJ_865685.1 


Halobacteria 


rirIA 


Chloroherpeton thalassium ATCC 35110 


YP_001 995975.1 


Chlorobi 




Tepidanaerobacter acetatoxydans Rel 


YP_007273179.1 


Firmicutes 




uncultured Chloroflexi bacterium 


BAL532071 


Chloroflexi 


rirl-k 


Deinococcus peraridilitoris DSM 19664 


YP_007181218.1 


Deinococcus-Thermus 


rir 1-b 


Acidovorax avenae subsp. avenae ATCC 19860 


Yr_UU4zJolzD.1 


Betaproteobacteria 




Acidovorax sp. CF316 


\A/D nn"70ccn-i i 1 
VVr_UU/oODUIz. 1 


Betaproteobacteria 




Acidovorax sp. NO-1 


WP_008903130.1 


Betaproteobacteria 




Actinomadura atramentaria 


Wr_Ul ab J 1 UoD. 1 


Actinobacteria 




Alicyclobacillus pohliae 


VVr_U 1 o lo 1 o/o. 1 


Firmicutes 




Aminomonas paucivorans 


\A/D nnconncon i 
VVr_UUDoUUozy. 1 


Synergistetes 




Ammonifex degensii KC4 


Wr_UUD JUUbza . 1 


Firmicutes 




Arhodomonas aquaeolei 


Wr_U1 o/lolol.T 


Gammaproteobacteria 




Bacillus licheniformis 


Vvr_Ulboobob 1. 1 


Firmicutes 




Bacillus subtilis 


VVr_UI /by/ IU4. 1 


Firmicutes 




Laioinrix sp. ruu boUJ 


Tr_UU/ Ub/4y. 1 


Cyanobacteria 




Candidatus Chloracidobacterium thermophilum B 


Yr_UU4obobbo. 1 


Acidobacteria 




Candidatus Desulforudis audaxviator MP104C 


VD nni"7i"7/tn 1 
Tr_UUI / 1 /4 Iz. 1 


Firmicutes 




Clostridiaceae bacterium L21-TH-D2 


vvr_UUbJ l4ybU. 1 


Firmicutes 




Deinococcus radiodurans R1 


MP TQCnQf^ 1 

iNr_zybuyb. i 


Deinococcus-Thermus 




Delftia acidovorans 


vVr_u 1 b4D 1 y4y . i 


Betaproteobacteria 




ueittia sp. Cs1-4 


\/n r\r\A /inmo/i i 

Y r_UU44yU / z4. 1 


Betaproteobacteria 




Desuifitobacterium hafniense 


vvr_UUbo IU4/b. 1 


Firmicutes 




Desulfovibrio magneticus RS-1 


VD nmnE;co/ii i 
Yr_UUzybbo4 1. 1 


Deltaproteobacteria 




Desulfovibrio sp. U5L 


Wr_uuy 1 UbbOo. 1 


Deltaproteobacteria 




Ferroplasma acidarmanus ferl 


Yr_UUo 14 Iboz. 1 


Euryarchaeota 




Ferroplasma sp. Type II 


\A/D mi"70"7C"70 1 

VVr_Uz 1 /o/b/o. 1 


Euryarchaeota 




Halomonas anticariGnsis 


Vvr_Ulb4 lo4zy. 1 


Gammaproteobacteria 




Halomonas jGotgali 


vvr_u 1 /^zyu 1 y. 1 


Gammaproteobacteria 




Halomonas smyrnonsis 


V V r_U 1 Dc50^ 1 U 1 . 1 


Gammaproteobacteria 




Mahella australiensis bU-1 bJUN 


VD r\r\A /i/nm/i -i 

Yr_UU44bzy /4.1 


Firmicutes 




Marinobacter lipolyticus 


\ A /D m O A r\C A~ICi 1 

VVr_Ulo4Ub4/y. 1 


Gammaproteobacteria 




IVlethanofollis liminatans 


vVr_UU4U4Uzoy. 1 


Euryarchaeota 




Methylobacter marinus 


WP_0201 60338.1 


Gammaproteobacteria 




Methylococcus capsulatus 


WP_01 7366201.1 


Gammaproteobacteria 




Methylomicrobium buryatense 


WP_01 784 1702.1 


Gammaproteobacteria 




nanoarchaeote Nsti 


WP_0045780171 






Nocardiopsis halotolerans 


WP_01 75723471 


Actinobacteria 




Polaromonas sp. JS666 


CAJ571771 


Cyanobacteria 




Pseudanabaena sp. PCC 6802 


WP_01 9499030.1 


Cyanobacteria 



(Continued) 
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Table 3 | Continued 



Intein allele 


Species name 


Accession number 


Phylum 




PsGudBndbdGtid sp. PCC 7367 


Y r_uu/ lu luyz. 1 


Cyanobacteria 




Rhoddnobsctor fulvus 


vvr_UU / UOZU 1 U. 1 


Gamnnaprot6obact6ria 




nnoasnoDdcier sp. zArbo i 


VD nn"7KQDQ01 1 

Yr_UU/ooooz 1. 1 


GammaproteobactGria 




Rhodanobdcter thiooxydans 


\A/D nnO/1Q"7TOO 1 

vVr_UUo4o /zoz. 1 


Gammaproteobacteria 




nnoaoinermus marinus bbU.bJr I /- 1 /z 


Yr_UU4oz4 1 lo. 1 


Bacteroidetes 




Staphylococcus aureus 


Wr_UlDlo/ /3z.l 


Firmicutes 




Synechococcus elongatus PCC 6301 


LAJo/ 1 /o. 1 


Cyanobacteria 




Synechococcus elongatus PCC 7942 


Yr_40UDZD.1 


Cyanobacteria 




Synechococcus sp. PCC 6312 


Yr_UU/IJbU//o. 1 


Cyanobacteria 




Thermoanaerobacterium saccharolytlcum JW/SUYS485 


Yr_U0Doyibol.1 


Firmicutes 




Thermoanaerobacterium thermosaccharolyticum DSM 571 


Yr_UUJob1U4J.l 


Firmicutes 




Thermobrachium celere 


VVr_U 1 obbo/yb. 1 


Firmicutes 




Thermococcus kodakarensis K0D1 


vn -1 o /I o -1 n -1 

Yr_lo4Jlz.l 


Euryarcliaeota 




Thermodesulfatator indicus DSM 15286 


Y r_UU4bzbzUb. 1 


Tliermodesulfobacteria 




Thermovirga lienil DSM 17291 


Yr_UU4yoZ loU. 1 


Deinococcus-Thermus 




Thermus igniterrae 


Wr_UlonU4JD.1 


Deinococcus-Thermus 




Thermus thermophilus HB8 


LAJb/ 1 /U. 1 


Deinococcus-Thermus 




Thiodlkalivibrio sp. ALE 11 


\A/D ni QC7nD'7Q 1 

vvr_uiyb/ucs/y. i 


Gammaproteobacteria 




Thioalkalivibrio sp. ALE30 


WP_01 8881426.1 


Gammaproteobacteria 




Thioalkalivibno sp. HL-EbIS 


WP_017926201.1 


Gammaproteobacteria 




Thioalkalivibno sp. K90mix 


YP_003459507.1 


Gammaproteobacteria 




uncultured bacterium 


EKE25755.1 






Xanthomonas sp. SHU199 


WP_017907463.1 


Gammaproteobacteria 




Xanthomonas sp. SHU308 


WP_017915139.1 


Gammaproteobacteria 




zeta proteobacterium SCGC AB-604-B04 


WP_01 8280466.1 


Zetaproteobacteria 


rirl-g 


Lnioronerpeton tnaiassium A I oo iiu 


Y r_uu 1 yyby /b. i 


CInlorobi 




Delnococcus aguatlllls 


\A/D ni am 1 7"77 1 
vvr_u 1 yu 1 1 / / /. 1 


Deinococcus-Thermus 




riaiotnece sp. rUL /4 lo 


VD nn7icc700 1 
Yr_UU/ Ibb/oz. 1 


Cyanobacteria 




Klebsiella pneumoniae 


\A/D mi01Q70Q 1 


Gammaproteobacteria 




Nocardiopsis dassonvillei subsp. Dassonvillei DSM 43111 


VD r\noconoo 1 
Yr_UUobo Izoo. 1 


Actinobacteria 




Nocardiopsis sp. CNS639 


vvr_u 1 ybuyb4b. 1 


Actinobacteria 




nnoaotnermus marinus obU.bJr I /- 1 /z 


VD rm/10TCT77 1 

Y r_UU4cSzbz/ /. 1 


Bacteroidetes 




Tepidanaerobacter acetatoxydans Rel 


VD nn7T7Q17Q 1 

Yr_UU/Z/o 1 /y. 1 


Firmicutes 




Thermomonospora curvata DSM 43183 


YP_003299200.1 


Actinobacteria 




Thermus thermophilus HB27 


YP_005899.1 


Deinococcus-Thermus 




Thermus thermophilus HB8 


CAJ57173.1 


Deinococcus-Thermus 




Thermus thermophilus JL^8 


YP_006059430.1 


Deinococcus-Thermus 




Thermus thermophilus SG0.5JP17-16 


YP_005639869.1 


Deinococcus-Thermus 




Trichodesium erythraeum IMS101 


VD 70nQCQ 1 

Y r_/zUobo. 1 


Cyanobacteria 




uncultured Chloroflexi bacterium 


BAL532071 


Chloroflexi 


rirl-m 


Thermus aquaticus 


WP_003044118.1 


Deinococcus-Thermus 




Thermus thermophilus HB-8 


CAJ57173.1 


Deinococcus-Thermus 




Thermus thermophilus SG0.5JP17-16 


YP_005639869.1 


Deinococcus-Thermus 




uncultured Chloroflexi bacterium 


BAL532071 


Chloroflexi 


udp 


Fervidibacteria bacterium JGI 0000001 -G 10 


WP_0202501371 






Dictyglomus thermophilum H-6-12 


YP_002250310.1 


Dictyglomi 




Methanocaldococcus jannaschii DSM 2661 


NP_248048.1 


Euryarchaeota 




Methanocaldococcus vulcanis M7 


YP_003246412.1 


Euryarchaeota 




Methanococcus aeolicus Nankai-3 


YP_001324612.1 


Euryarchaeota 




Methanothermococcus okinawensis IH1 


YP_004575831.1 


Euryarchaeota 



(Continued) 
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Table 3 | Continued 



Intein allele 


Species name 


Accession number 


Phylum 




Methanotorris igneus Kol 5 
Thermococus gammatolerans EJ3 


WP_007044255.1 
YP_002960518.1 


Euryarchaeota 
Euryarchaeota 


topA 


Methanotorris igneus Kol 5 


WP_007044255.1 


Euryarchaeota 


top6B 


IHaiarctiaeum acidiptiiium 


WP_021780130.1 


Halobacterium 



'Indicates tfie intein detected is a mini-intein. 

-^Indicates taxa tliat grouped witfiin tlie halobacterial intein sequences. 



the cluster while the other clusters all contain intein alleles that 
are unique to certain members of the cluster, suggesting ongo- 
ing transfer of these inteins within the population. Additionally, 
three out of the twelve total clusters demonstrate unique branch- 
ing orders compared to the reference tree, though only five of the 
clusters reflected in the reference tree have identical intein pro- 
files. The lack of fixation for the intein alleles in the majority of 
clusters (seven out of twelve) indicates that a signal due to vertical 
inheritance may aid the formation of the clusters, but that HGT 
and its bias is the driving force for intein distribution. This anal- 
ysis demonstrates the utility of intein sequences in distinguishing 
a population structure amongst genomes isolated from the same 
location, as demonstrated with the genomes isolated from Aran- 
Bidgol. These relationships are made evident through analyzing 
all of the signals from each of the intein alleles represented in the 
strains, and thus represent a collapsed view of the major gene 
sharing networks that have shaped the intein profiles of these 
strains over time. The collapsed networks indicate a higher rate of 
recombination within compared to between species and groups, 
a finding similar to the sexual outcrossing in fungal populations 
where inteins also thrive, as the semi-sexual lifestyle promotes 
intein homing (Giraldo-Perez and Goddard, 2013). 

It is tempting to speculate that strains that harbor an abun- 
dance of intein alleles partake in more gene transfer than their 
counterparts without as many inteins; however, these two phe- 
nomena should not be expected to have a strict correlation as 
HGT between strains that possess only one intein each cannot 
produce hybrids with more than two inteins each. The num- 
ber of inteins present in a group of different strains and species 
may be more reflective of transfers with divergent organisms than 
within-group transfer frequency. 

The presented research demonstrates the utility of intein 
sequences to foUow gene flow within and between populations. 
Improved reliability to assess the presence and activity of the HEN 
domain intein will provide a better distinction between verti- 
cal and horizontal inheritance of inteins. The overall utility of 
inteins improves as new intein alleles and new host proteins are 
reported, increasing the distribution of samples and improving 
statistical robustness of studies like the one done here. Prior to 
this work, nine proteins had been reported to contain inteins in 
the Halobacteria. This work established seven new intein alleles in 
the Halobacteria, including two proteins not previously reported 
to contain inteins. The presence of inteins is especially useful 
in populations where high rates of recombination and widely 



distributed populations may facilitate the maintenance of intein 
sequences over long periods of time (Gogarten and Hilario, 2006) 
and provide a means for distinguishing closely related partners 
involved in genetic transfers. The phylogenetic distribution of 
intein alleles, combined with the changing state within intein alle- 
les, and the rapid substitution rate of inteins relative to the extein 
host sequences (Swithers et al., 2013) will provide a valuable tool 
to infer gene flow dynamics in and between sampled populations. 
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